Multivariate Normal Distribution

  • Random variable: \(\mathbf{y} \in \mathbb{R}^n\),
  • Mean vector: \(\mu \in \mathbb{R}^n\)
  • Variance-covariance matrix: \(\Sigma \in \mathbb{R}^{n \times n}\)
  • \(\Sigma_{\{1,1\}}\) is the variance of \(y_{1}\). \(\Sigma_{\{1,2\}}\) is \(Cov(y_1, y_2)\)

\[ f(\mathbf{y}\vert\mu, \Sigma) = (2\pi)^{-n/2}\lvert\Sigma\rvert^{-1/2}\exp\left\{-\frac{1}{2}(\mathbf{y}-\mu)^T\Sigma^{-1}(\mathbf{y}-\mu)\right\}\]

Regression MLE estimate

  • Given response, \(y\), and predictors, \(X\)
  • Assume \(\Sigma = \sigma^2 \mathbf{I}\)
  • Assume \(\mu = X\beta\) where \(\beta \in \mathbb{R}^p\) and \(X \in \mathbb{R}^{n \times p}\)
  • Find \(\beta\) that maximizes the likelihood \(L(\beta,\sigma^2\vert y) =\)

\[(2\pi)^{-n/2}\lvert\sigma^2I\rvert^{-1/2}\exp\left\{-\frac{1}{2}(\mathbf{y}-X\beta)^T(\sigma^2I)^{-1}(\mathbf{y}-X\beta)\right\}\] - Equivalent to maximizing log-likelihood, \(\mathcal{l}(\beta, \sigma^2\vert y)=\) \[ -\frac{n}{2}\log(2\pi)+ -\frac{n}{2}\log(\sigma^2) + \left\{\frac{-1}{2}(\mathbf{y}-X\beta)^T(\sigma^2I)^{-1}(\mathbf{y}-X\beta)\right\} \]

MLE of \(\beta\)

\[ \begin{align*} \frac{\partial}{\partial\beta}\frac{-1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta) \overset{set}{=}& 0\\ \frac{\partial}{\partial\beta}(\mathbf{y}^T\mathbf{y} - \mathbf{y}^TX\beta - \beta^TX^T\mathbf{y} + \beta^TX^TX\beta) =& 0 \\ \frac{\partial}{\partial\beta}(\mathbf{y}^T\mathbf{y} - 2\beta^TX^T\mathbf{y} + \beta^TX^TX\beta) =& 0 \\ -2X^T\mathbf{y} + 2X^TX\beta =& 0 \\ \text{Normal Equations}~~~~X^TX\beta =& X^T\mathbf{y} \\ \widehat{\beta} = (X^TX)^{-1}X^Ty \end{align*}\]

Linear Model Assumptions

Model: \[ Y = X\beta + \epsilon\]

\[\epsilon \sim N(0, \sigma^2)\]

  • Linearity
  • Constant variance
  • Normally distributed errors
  • Independent Errors

These assumptions allow us to calculate the variance and distribution of \(\widehat{\beta}\) under the null hypothesis.

An easy example

ggplot(sim1, aes(y=Y, x = X1)) + geom_point()

Fit a linear model

fit1 <- lm(Y~X1, sim1)
fit1_summ <- summary(fit1)
fit1_summ
## 
## Call:
## lm(formula = Y ~ X1, data = sim1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.347 -1.321  0.016  1.260  5.774 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.81010    0.25652   10.96   <2e-16 ***
## X1           2.07021    0.04505   45.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.011 on 248 degrees of freedom
## Multiple R-squared:  0.8949, Adjusted R-squared:  0.8945 
## F-statistic:  2111 on 1 and 248 DF,  p-value: < 2.2e-16

Inspect Model Object

str(fit1_summ)
## List of 11
##  $ call         : language lm(formula = Y ~ X1, data = sim1)
##  $ terms        :Classes 'terms', 'formula'  language Y ~ X1
##   .. ..- attr(*, "variables")= language list(Y, X1)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "Y" "X1"
##   .. .. .. ..$ : chr "X1"
##   .. ..- attr(*, "term.labels")= chr "X1"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(Y, X1)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "Y" "X1"
##  $ residuals    : Named num [1:250] 0.27 -1.51 -3.264 -1.52 0.038 ...
##   ..- attr(*, "names")= chr [1:250] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 2.8101 2.0702 0.2565 0.0451 10.9547 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "X1"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "X1"
##  $ sigma        : num 2.01
##  $ df           : int [1:3] 2 248 2
##  $ r.squared    : num 0.895
##  $ adj.r.squared: num 0.894
##  $ fstatistic   : Named num [1:3] 2111 1 248
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.016266 -0.002481 -0.002481 0.000502
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "X1"
##   .. ..$ : chr [1:2] "(Intercept)" "X1"
##  - attr(*, "class")= chr "summary.lm"
sigmaHat <- fit1_summ$sigma
sigmaHat
## [1] 2.011301

Add predictions and residuals to dataframe

sim1_fit1 <- 
  sim1 %>% 
  add_predictions(fit1) %>% 
  add_residuals(fit1) %>%
  arrange(pred)

head(sim1_fit1)
## # A tibble: 6 x 4
##       Y     X1  pred  resid
##   <dbl>  <dbl> <dbl>  <dbl>
## 1  3.70 0.0950  3.01  0.689
## 2  4.68 0.137   3.09  1.59 
## 3  1.36 0.146   3.11 -1.75 
## 4  6.39 0.222   3.27  3.12 
## 5  8.08 0.267   3.36  4.72 
## 6  2.10 0.314   3.46 -1.36

Plot data with predictions

ggplot(sim1_fit1, aes(x = X1, y=Y)) +
  geom_point() + 
  geom_line(aes(y=pred), col="red")

Plot residuals versus predictions - should have no trends

Check for normality of residuals

sim1_fit1 %>%
  mutate(norm_pdf = dnorm(resid, 0, sd = sigmaHat)) %>%
ggplot(aes(x = resid, y = ..density..)) + 
  geom_histogram(bins = 30) +
  geom_line(aes(y = norm_pdf),col = "red")

Easier to use qqplot

ggplot(sim1_fit1, aes(sample=resid)) + stat_qq() + stat_qq_line()

Leverage

Remember: \[\widehat{\beta} = (X^TX)^{-1}X^TY\] \[\widehat{Y} = X\widehat{\beta} = X(X^TX)^{-1}X^TY = HY\]

  • \(H=P_x\) is the “hat” matrix
  • \(H\) projects \(Y\) onto the column space of \(X\).
  • The predictions are a weighted sum of the observations.
  • Leverage of \(Y_i\) is defined as the \(i\)th diagonal element of \(H\).
  • If the leverage is large, the weight of \(Y_i\) on \(\widehat{Y_i}\) is very large.

Cook’s distance

\[ D_i = \frac{1}{p}e_i^2 \frac{h_i}{1-h_i} \]

  • \(e_i\) is the \(i\)th residual
  • \(h_i\) is the leverage of the \(i\)th point
  • \(p\) is the number of columns in \(X\) (including intercept)

Plot

Observations with large values of Cook’s distance should be checked for overly influencing fit.

plot(fit1,5)

Another Example

Model Fit

fit2 <- lm(Y~X1*X2, sim2)
summary(fit2)
## 
## Call:
## lm(formula = Y ~ X1 * X2, data = sim2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4288 -0.2779 -0.0079  0.3259  1.2693 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.80620    0.14240  12.684   <2e-16 ***
## X11          0.19380    0.20356   0.952    0.342    
## X2           0.56209    0.04704  11.949   <2e-16 ***
## X11:X2       0.93094    0.06622  14.059   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4905 on 246 degrees of freedom
## Multiple R-squared:  0.9341, Adjusted R-squared:  0.9333 
## F-statistic:  1163 on 3 and 246 DF,  p-value: < 2.2e-16

Plot predictions

sim2 %>%
add_predictions(fit2) %>%
  ggplot(aes(x=X2, y=Y, col=X1)) + 
  geom_point() +
  geom_line(aes(y = pred, group = X1), col="black")

Diagnostic plots

A harder example

Fit a model

fit_additive <- lm(Y ~ X1 + X2 + X3, sim3)
summary(fit_additive)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = sim3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3232 -1.6434 -0.8110  0.6591 11.3232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.88475    0.43332    4.35 2.00e-05 ***
## X11          5.24671    0.33560   15.63  < 2e-16 ***
## X2           0.98406    0.18498    5.32 2.33e-07 ***
## X3           0.83892    0.05802   14.46  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.649 on 246 degrees of freedom
## Multiple R-squared:  0.6944, Adjusted R-squared:  0.6906 
## F-statistic: 186.3 on 3 and 246 DF,  p-value: < 2.2e-16

Diagnostic plots

Partial Residual Plots

  • Regress Y on everything except X2 - get resid1
  • Regress X2 on all other Xs - get resid2
  • Plot resid1 vs resid 2
car::crPlots(fit_additive)

Fit Polynomial

fit_poly <- lm(Y ~ X1 + X2 + I(X3^2), sim3)
summary(fit_poly)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + I(X3^2), data = sim3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2064 -0.4643 -0.0155  0.4660  1.6059 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.981588   0.120247   8.163  1.7e-14 ***
## X11         5.149831   0.092545  55.647  < 2e-16 ***
## X2          0.969575   0.049849  19.450  < 2e-16 ***
## I(X3^2)     0.202060   0.002669  75.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7307 on 246 degrees of freedom
## Multiple R-squared:  0.9767, Adjusted R-squared:  0.9765 
## F-statistic:  3443 on 3 and 246 DF,  p-value: < 2.2e-16

Diagnostic plots

Check for interactions

fit_int <- lm(Y ~ X1*X2 + X1*X3 + I(X3^2), sim3)
summary(fit_int)
## 
## Call:
## lm(formula = Y ~ X1 * X2 + X1 * X3 + I(X3^2), data = sim3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55240 -0.34312 -0.02283  0.34847  1.48022 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.948587   0.109198  17.844   <2e-16 ***
## X11          2.911424   0.164145  17.737   <2e-16 ***
## X2           0.504723   0.049244  10.249   <2e-16 ***
## X3          -0.003321   0.019841  -0.167    0.867    
## I(X3^2)      0.201514   0.002692  74.858   <2e-16 ***
## X11:X2       1.074459   0.074775  14.369   <2e-16 ***
## X11:X3      -0.009878   0.023299  -0.424    0.672    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5306 on 243 degrees of freedom
## Multiple R-squared:  0.9879, Adjusted R-squared:  0.9876 
## F-statistic:  3302 on 6 and 243 DF,  p-value: < 2.2e-16

Check for interactions

fit_int2 <- lm(Y ~ X1*X2 + X3 + I(X3^2), sim3)
summary(fit_int2)
## 
## Call:
## lm(formula = Y ~ X1 * X2 + X3 + I(X3^2), data = sim3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54085 -0.33631 -0.03189  0.34924  1.51108 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.950783   0.108892   17.91   <2e-16 ***
## X11          2.911575   0.163869   17.77   <2e-16 ***
## X2           0.508864   0.048184   10.56   <2e-16 ***
## X3          -0.008216   0.016109   -0.51    0.611    
## I(X3^2)      0.201474   0.002686   75.02   <2e-16 ***
## X11:X2       1.064843   0.071132   14.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5297 on 244 degrees of freedom
## Multiple R-squared:  0.9879, Adjusted R-squared:  0.9876 
## F-statistic:  3976 on 5 and 244 DF,  p-value: < 2.2e-16

Compare Nested Models

anova(fit_int, fit_int2)
## Analysis of Variance Table
## 
## Model 1: Y ~ X1 * X2 + X1 * X3 + I(X3^2)
## Model 2: Y ~ X1 * X2 + X3 + I(X3^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    243 68.407                           
## 2    244 68.458 -1 -0.050603 0.1798  0.672

Plot predictions

Another Example

Fit a model

fit4 <- lm(Y ~ X1 + X2 + X3, sim4)
summary(fit4)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = sim4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.860 -2.713 -1.036  1.470 40.663 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.0825     2.7538  -3.661 0.000307 ***
## X11          -3.5548     0.5880  -6.045 5.51e-09 ***
## X2            4.0668     1.4081   2.888 0.004219 ** 
## X3            8.0835     0.2953  27.374  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.609 on 246 degrees of freedom
## Multiple R-squared:  0.7861, Adjusted R-squared:  0.7835 
## F-statistic: 301.4 on 3 and 246 DF,  p-value: < 2.2e-16

Diagnostic plots

Partial Residual Plots

crPlots(fit4)

Try a polynomial

fitq <- lm(Y ~ X1 + X2 + X3 + I(X3^2))
summary(fitq)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + I(X3^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6841 -0.9499  0.1055  1.1619 19.5664 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3491     1.5197   0.230    0.818    
## X1           -2.8342     0.3134  -9.042  < 2e-16 ***
## X2            3.3779     0.7479   4.517 9.76e-06 ***
## X3           -3.5428     0.4896  -7.236 5.92e-12 ***
## I(X3^2)       2.7291     0.1089  25.065  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 245 degrees of freedom
## Multiple R-squared:   0.94,  Adjusted R-squared:  0.939 
## F-statistic: 959.5 on 4 and 245 DF,  p-value: < 2.2e-16

Diagnostic plots

Transform Response

fitlog <- lm(log(Y) ~ X1 + X2 + X3, sim4)
summary(fitlog)
## 
## Call:
## lm(formula = log(Y) ~ X1 + X2 + X3, data = sim4)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263136 -0.077265  0.003242  0.073860  0.303209 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.505105   0.062164   8.125 2.17e-14 ***
## X11         -0.206954   0.013275 -15.590  < 2e-16 ***
## X2           0.352788   0.031785  11.099  < 2e-16 ***
## X3           0.596027   0.006666  89.413  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.104 on 246 degrees of freedom
## Multiple R-squared:  0.9752, Adjusted R-squared:  0.9749 
## F-statistic:  3227 on 3 and 246 DF,  p-value: < 2.2e-16

Diagnostic plots

Partial Residual Plots

crPlots(fitlog)

Add interactions

fitlog2 <- lm(log(Y) ~ X1*X2 + X1*X3, sim4)
summary(fitlog2)
## 
## Call:
## lm(formula = log(Y) ~ X1 * X2 + X1 * X3, data = sim4)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.248117 -0.066699 -0.000952  0.065165  0.227754 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.150112   0.084431   1.778   0.0767 .  
## X11          0.463198   0.115079   4.025 7.61e-05 ***
## X2           0.541764   0.042982  12.604  < 2e-16 ***
## X3           0.583208   0.008589  67.901  < 2e-16 ***
## X11:X2      -0.364864   0.059559  -6.126 3.58e-09 ***
## X11:X3       0.029539   0.012479   2.367   0.0187 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0972 on 244 degrees of freedom
## Multiple R-squared:  0.9785, Adjusted R-squared:  0.9781 
## F-statistic:  2226 on 5 and 244 DF,  p-value: < 2.2e-16

Plot predictions

sim4 %>%
  add_predictions(fitlog2) %>%
  ggplot(aes(x=pred, y = log(Y))) + geom_point() +
  geom_abline(slope = 1, intercept = 0)

What if a transformation isn’t enough?

Boostrap the model coefficient CIs

fit5 <- lm(Y~X1 + X2 + X3, sim5)

model_lm <- 
  sim5 %>% 
  modelr::bootstrap(1000) %>%
  mutate(model = map(strap, ~lm(Y ~ X1 + X2 + X3, data=.)))

Extract Coefficients

rq_tidy <- function(fit) {
  df <- data.frame(term = names(coef(fit)), estimate = coef(fit))
  row.names(df) <- NULL
  return(df)
}

param_lm <- 
  model_lm %>%
  mutate(param = map(model, rq_tidy)) %>%
  select(.id, param) %>% 
  unnest() 

Results

param_lm  %>%
  group_by(term) %>%
  summarise(
    mean = mean(estimate), 
    q025 = quantile(estimate, .025), 
    q975 = quantile(estimate, .975)
  )
## # A tibble: 4 x 4
##   term           mean   q025  q975
##   <fct>         <dbl>  <dbl> <dbl>
## 1 (Intercept)  3.63    1.67  5.73 
## 2 X11         -0.0612 -0.477 0.367
## 3 X2          -0.435  -1.56  0.648
## 4 X3           0.724   0.476 0.975
confint.lm(fit5)
##                  2.5 %    97.5 %
## (Intercept)  1.3162078 5.8691055
## X11         -0.4968357 0.3670706
## X2          -1.5663352 0.7413726
## X3           0.5126871 0.9372699